144 research outputs found
Clustering Stability: An Overview
A popular method for selecting the number of clusters is based on stability
arguments: one chooses the number of clusters such that the corresponding
clustering results are "most stable". In recent years, a series of papers has
analyzed the behavior of this method from a theoretical point of view. However,
the results are very technical and difficult to interpret for non-experts. In
this paper we give a high-level overview about the existing literature on
clustering stability. In addition to presenting the results in a slightly
informal but accessible way, we relate them to each other and discuss their
different implications
Kernel functions based on triplet comparisons
Given only information in the form of similarity triplets "Object A is more
similar to object B than to object C" about a data set, we propose two ways of
defining a kernel function on the data set. While previous approaches construct
a low-dimensional Euclidean embedding of the data set that reflects the given
similarity triplets, we aim at defining kernel functions that correspond to
high-dimensional embeddings. These kernel functions can subsequently be used to
apply any kernel method to the data set
Shortest path distance in random k-nearest neighbor graphs
Consider a weighted or unweighted k-nearest neighbor graph that has been
built on n data points drawn randomly according to some density p on R^d. We
study the convergence of the shortest path distance in such graphs as the
sample size tends to infinity. We prove that for unweighted kNN graphs, this
distance converges to an unpleasant distance function on the underlying space
whose properties are detrimental to machine learning. We also study the
behavior of the shortest path distance in weighted kNN graphs.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
A Tutorial on Spectral Clustering
In recent years, spectral clustering has become one of the most popular
modern clustering algorithms. It is simple to implement, can be solved
efficiently by standard linear algebra software, and very often outperforms
traditional clustering algorithms such as the k-means algorithm. On the first
glance spectral clustering appears slightly mysterious, and it is not obvious
to see why it works at all and what it really does. The goal of this tutorial
is to give some intuition on those questions. We describe different graph
Laplacians and their basic properties, present the most common spectral
clustering algorithms, and derive those algorithms from scratch by several
different approaches. Advantages and disadvantages of the different spectral
clustering algorithms are discussed
Explaining the Explainer: A First Theoretical Analysis of LIME
Machine learning is used more and more often for sensitive applications,
sometimes replacing humans in critical decision-making processes. As such,
interpretability of these algorithms is a pressing need. One popular algorithm
to provide interpretability is LIME (Local Interpretable Model-Agnostic
Explanation). In this paper, we provide the first theoretical analysis of LIME.
We derive closed-form expressions for the coefficients of the interpretable
model when the function to explain is linear. The good news is that these
coefficients are proportional to the gradient of the function to explain: LIME
indeed discovers meaningful features. However, our analysis also reveals that
poor choices of parameters can lead LIME to miss important features.Comment: Accepted to AISTATS 202
Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions
Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal, and instead can lead to inconsistency. We construct examples which provably have this behavior. As in the case of supervised learning, the cure is to restrict the size of the function classes under consideration. For appropriate ``small'' function classes we can prove very general consistency theorems for clustering optimization schemes. As one particular algorithm for clustering with a restricted function space we introduce ``nearest neighbor clustering''. Similar to the k-nearest neighbor classifier in supervised learning, this algorithm can be seen as a general baseline algorithm to minimize arbitrary clustering objective functions. We prove that it is statistically consistent for all commonly used clustering objective functions
Graph Laplacians and their convergence on random neighborhood graphs
Given a sample from a probability measure with support on a submanifold in
Euclidean space one can construct a neighborhood graph which can be seen as an
approximation of the submanifold. The graph Laplacian of such a graph is used
in several machine learning methods like semi-supervised learning,
dimensionality reduction and clustering. In this paper we determine the
pointwise limit of three different graph Laplacians used in the literature as
the sample size increases and the neighborhood size approaches zero. We show
that for a uniform measure on the submanifold all graph Laplacians have the
same limit up to constants. However in the case of a non-uniform measure on the
submanifold only the so called random walk graph Laplacian converges to the
weighted Laplace-Beltrami operator.Comment: Improved presentation, typos corrected, to appear in JML
- …